Tools for Efficient Research Workflows

Lukas Lehner

2024-11-14

Part II: GitHub

Why GitHub? The analogy of climbing

Source: illustrations by @allison_horst

Git

  • Who has used Git previously? What for?
  • Who has used any kind of version control previously?

Git as Version Control

You all have used version control previously:

  • “Save early, save often”.
  • Easiest version control: the back-button.
  • “Track-changes” in MS Word is a rudimentary form of version control.

 

Git is a sophisticated form of version control. Git…

  • … maintains a single, updated version of each file.
  • … keeps a record of all previous versions.
  • … keeps a record of exact changes between the versions.
  • … collaborators can work simultaneously.
  • … documents who made changes, when and why.

Why should I learn yet another tool?

Why should I learn yet another tool? Git as Version Control

  • Maintain an overview
  • Access previous versions

 

  • Strengthen Collaboration
  • Foster Transparency

Git: a preview

Git: Terminiology (1): pillars

  • repository (repo)
  • commit
  • diff

 

  • branch
  • remote
  • local
  • commit message and tag
  • gist
  • README

Git: Terminiology (1): pillars

  • repository (repo): directory of files
  • commit: snapshot of directory
  • diff: difference between two commits

 

  • branch: detour from main stream without changing main stream
  • remote: repo hosted online
  • local: repo on your hard drive (offline)
  • commit message and tag: notes assigned to commits
  • gist: small repo to share one code file
  • README: “About me” section of your repository or your GitHub profile

Git: Terminiology (2): actions

  • to commit
  • to merge
  • to fork
  • to clone
  • to push
  • to pull

Git: Terminiology (2): actions

  • to commit: create a commit
  • to merge: merge on branch into another branch
  • to fork: create a copy of someone else’s repo in your GitHub account
  • to clone: create your local copy of the repo
  • to push: upload changes from your local to your remote
  • to pull: update local from remote

Publishing Code > README.md

Add a “Usage” and “Contributing” section to your README.md

  • Add a sentence or two on the WHY of the project
  • Add a section “Usage” on how to install/use your project
  • Have a simple and short code example showcasing how to use the project
  • Explain the basic project structure

Publishing Code > Add a License

  • MIT
    • pro: easy to understand and use
    • con: organisations and individuals can use your code without contributing back
  • GPLv3
    • pro: organisations and individuals have to contribute back to the project when your code is used in public projects
    • con: not as easy, some organisation do not want to use software they need to contribute back to
  • Creative Commons
    • pro: Allows you to customize non-commercial or commercial usage and whether it can be used without or with attribution
    • con: the many version lead to most people not knowing them and ignoring the license

GitHub Desktop: Getting started

GitHub Desktop: track version history

Forking and working on your own repo: exercise 1

https://github.com/lukaslehner/Zurich_2024_workflows_workshop/tree/main

Collaborating on the same repo: exercise 2

https://github.com/lukaslehner/Zurich_2024_workflows_training

Two common errors

  • Push rejected. This can happen if you have changes on the remote and on your local repo. > - Solution: Pull first. Resolve the conflict. Then try your push again.

  • fatal: not a git repository. The command cannot be executed because the current directory is not a Git directory. > - Solution: initialize the repo or change directory to the repo

Some advice before we practice

  • Commit early and often.

  • Push to your remote on GitHub often (but not as often as you commit).

  • Establish a naming convention for commits.

  • Use tags to mark key steps.

  • Fork and clone from foreign repos (instead of “just cloning”)

  • Branch of your development version, especially in teams.

Further Resources

Readings with further information

Bonus: Dealing with pull requests and merge conflicts

Git basics

Generally, git operates through a shell. (Later on, we will install a GUI can make life easier.)

What is a shell?

A shell (or terminal) is a program on your computer whose job is to run other programs, rather than do calculations itself.

Let’s start open the shell in In RStudio: Tools > Shell.

A note for Windows users: the default Windows shell does not support git commands. However, we can solve this by installing GitBash - a light shell that does support git commands.

Git: Terminiology (3): tools

Git is the command line version control system (VCS) software, which works on your local computer.

GitHub is an internet hosting service for git repositories.

GitHub Desktop is an application that enables you to interact with GitHub using a GUI instead of the command line or a web browser.

Bonus: Reproducible research

Reproducibility vs. replication? 🤔

Replicability refers to situations in which a researcher obtains new data to reach the same scientific conclusions as a previous study, whereas reproducibility refers to situations in which the original researcher’s software, code, and data are used to regenerate the results.

Replication standards: guidelines, protocols, and software designed to help researchers share, analyze, archive, preserve, distribute, catalog, translate, verify, and replicate scholarly research data and analyses across disciplines. Includes proposals to improve the norms around data sharing and replication in scientific research.

What hinders reproducible research and what can facilitate it?


Obstacles 🚧

  • Infrastructure and research habits
  • Hardware requirements
  • Operating systems
  • Versions of software and libraries

Solutions ✨

  • Optimised workflows (integrating coding, authoring, version control)
  • Virtual machines for computationally intense analyses
  • Containerisation

Why Open Data?

Efficiency 🏇

Science is not built upon blind trust, but on verifiability. Science as “organized skepticism” (Merton, 1947). Only when raw data and other research material is shared such organized skepticism can be implemented, and science can self-correct. One aspect of good scientific practice is Open Data.

Data persistence 👴

Reliable infrastructure for storage and publication (e.g., subject-specific repositories, institutional repositories)

Funding requirements 👮

Plan S principle: “from 2021, scientific publications that result from research funded by public grants must be published in compliant Open Access journals or platforms.” (Sherpa Romeo database; fairsharing.org)